3 research outputs found

    Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

    Get PDF
    The objective of the research in this dissertation is to derive optimal search schemes for approximate string matching using bidirectional FM-index, and utilize them in increasing the speed of such searches. Such a problem arises in computer science with many applications. Approximate string matching problem is also central in bioinformatics where biologists are interested in aligning pieces of DNA back to genome. Given a text, the search for a given pattern can be accelerated by preprocessing the text through constructing a hash table or indexing the text. Bidirectional indices have opened new possibilities by allowing a search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Prior work tends to use search heuristics but lacks the ability to find the best strategies for using an index to search for a pattern. In this dissertation, we will find the optimal search scheme for approximate string matching problem for a bidirectional index with the assumption of having the number of partitions. Moreover, we will investigate the computational gain from applying these optimal search schemes to search in a bidirectional FM-index. Intellectual Merit. First, we propose an MIP formulation to find the optimal search scheme for approximate string matching problem using a bidirectional index under Hamming distance error. Second, we demonstrate that our MIP can solve the optimum search scheme problem to optimality in a reasonable amount of time for input parameters of considerable size, and enjoys very quick convergence to optimal or near-optimal solutions for input parameters of larger size. Third, we show that approximate search in a bidirectional FM-index can be performed significantly faster if the optimal schemes obtained from our MIP are used. This is demonstrated based on number of edges in the search tries as well as actual running time of in-index search for Illumina DNA Sequencing reads (up to 35 times faster than standard backtracking for 3 errors). Although our MIP solutions are for Hamming distance, they perform equally well for edit distance. Fourth, we demonstrate that our optimal search schemes is superior to the best of in-index aligners for 2 and 3 errors. In an attempt to acquire a glimpse of the potential of combining our optimal search schemes with in-text verification, we combine optimal search scheme and in-text verification for Hamming distance. This experiment halved the running time for reads of size 101 and 125. Furthermore, we showcase the power of our optimal search schemes by demonstrating that for 1 to 3 errors, approximate string matching of reads of size 40, 101, and 125 performed completely in index compete in running time with the best full-fledged aligners, which benefit from combining search in index with in-text verification for edit distance. Moreover, we will relax the assumption of having equal size partitions in our MIP and address the more general form of approximate string matching problem where the only assumption is the prespecified number of partitions. We will present an MIP formulation for edit distance and provide an alternative formulation for Hamming distance. Broader Impacts. The results of this research promise a significant increase in speed of finding approximate occurrences of a pattern in a text. This is an important problem with many applications in bioinformatics and computer science such as recovering text in signal processing and information retrieval [23]. Approximate string matching plays an indisputable role in the realm of bioinformatics, where any downstream analysis on the genomic data starts with aligning sequenced DNA or RNA reads back to a reference genome. Technologies such as next generation sequencing has produced considerable amount of data leading to increasing demand for fast read aligners to map DNA pieces to genome. In order to solve this central problem, one could consider the genome of any species of interest as the "text" and the sequenced pieces of DNA as the "patterns" and therefore search for approximate occurrences of a pattern in a text using a full-text index. Some tolerance for errors is required due to mutations in genome of each individual organism such as single nucleotide variants (SNVs) as well as errors in sequencing technologies. This broad spectrum of applications indicates the significant impact of this research on many areas of health and life sciences and practice, where discovery, diagnosis, and treatment all depend on genome sequencing

    Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

    Full text link
    Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming outperforms today's best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work

    Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

    Get PDF
    The objective of the research in this dissertation is to derive optimal search schemes for approximate string matching using bidirectional FM-index, and utilize them in increasing the speed of such searches. Such a problem arises in computer science with many applications. Approximate string matching problem is also central in bioinformatics where biologists are interested in aligning pieces of DNA back to genome. Given a text, the search for a given pattern can be accelerated by preprocessing the text through constructing a hash table or indexing the text. Bidirectional indices have opened new possibilities by allowing a search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Prior work tends to use search heuristics but lacks the ability to find the best strategies for using an index to search for a pattern. In this dissertation, we will find the optimal search scheme for approximate string matching problem for a bidirectional index with the assumption of having the number of partitions. Moreover, we will investigate the computational gain from applying these optimal search schemes to search in a bidirectional FM-index. Intellectual Merit. First, we propose an MIP formulation to find the optimal search scheme for approximate string matching problem using a bidirectional index under Hamming distance error. Second, we demonstrate that our MIP can solve the optimum search scheme problem to optimality in a reasonable amount of time for input parameters of considerable size, and enjoys very quick convergence to optimal or near-optimal solutions for input parameters of larger size. Third, we show that approximate search in a bidirectional FM-index can be performed significantly faster if the optimal schemes obtained from our MIP are used. This is demonstrated based on number of edges in the search tries as well as actual running time of in-index search for Illumina DNA Sequencing reads (up to 35 times faster than standard backtracking for 3 errors). Although our MIP solutions are for Hamming distance, they perform equally well for edit distance. Fourth, we demonstrate that our optimal search schemes is superior to the best of in-index aligners for 2 and 3 errors. In an attempt to acquire a glimpse of the potential of combining our optimal search schemes with in-text verification, we combine optimal search scheme and in-text verification for Hamming distance. This experiment halved the running time for reads of size 101 and 125. Furthermore, we showcase the power of our optimal search schemes by demonstrating that for 1 to 3 errors, approximate string matching of reads of size 40, 101, and 125 performed completely in index compete in running time with the best full-fledged aligners, which benefit from combining search in index with in-text verification for edit distance. Moreover, we will relax the assumption of having equal size partitions in our MIP and address the more general form of approximate string matching problem where the only assumption is the prespecified number of partitions. We will present an MIP formulation for edit distance and provide an alternative formulation for Hamming distance. Broader Impacts. The results of this research promise a significant increase in speed of finding approximate occurrences of a pattern in a text. This is an important problem with many applications in bioinformatics and computer science such as recovering text in signal processing and information retrieval [23]. Approximate string matching plays an indisputable role in the realm of bioinformatics, where any downstream analysis on the genomic data starts with aligning sequenced DNA or RNA reads back to a reference genome. Technologies such as next generation sequencing has produced considerable amount of data leading to increasing demand for fast read aligners to map DNA pieces to genome. In order to solve this central problem, one could consider the genome of any species of interest as the "text" and the sequenced pieces of DNA as the "patterns" and therefore search for approximate occurrences of a pattern in a text using a full-text index. Some tolerance for errors is required due to mutations in genome of each individual organism such as single nucleotide variants (SNVs) as well as errors in sequencing technologies. This broad spectrum of applications indicates the significant impact of this research on many areas of health and life sciences and practice, where discovery, diagnosis, and treatment all depend on genome sequencing
    corecore